{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression\n",
    "In this notebook, we will walk you through all the examples of linear regression models using the using the **ScikitLearn.jl** package. \n",
    "\n",
    "In order to successfully run these examples, it is required that you have installed both the **ScikitLearn** and the **Plots** Julia packages. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:red\"><b>Note!</b></span> \n",
    "\n",
    "The package Plots is NOT necessary for running a Linear Regression model! In this notebook, we use it with the unique purpose of easily visualizing the approximation of Linear Regression models to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Linear Regression Models\n",
    "using ScikitLearn\n",
    "\n",
    "# Visualization\n",
    "using Plots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simple Linear Regression\n",
    "A simple linear regression is a linear regression model with a single dependent variable and a single independent variable, i.e., it is a two-dimensional model where the dependent variable `y` is described by a single variable `x`.\n",
    "\n",
    "In the following section we see how to create a simple linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 50 1-dimensional samples\n",
    "X = 5 * rand(50, 1)\n",
    "y = sin.(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that our matrix `X` has dimension `n_samples x n_features` as is required by the [ScikitLearnBase API](https://scikitlearnjl.readthedocs.io/en/latest/api/). Same for `y`.\n",
    "\n",
    "Let's confirm that!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "println(\"size(X) == (50, 1): $(size(X) == (50, 1))\")\n",
    "println(\"size(y) == (50, 1): $(size(y) == (50, 1))\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In real life applications, our data usually is associated with some noise. Let's add some noise to our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Every 5 points are shifted by a random amount\n",
    "y[1:5:end] += 5 * (0.5 .- rand(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it's time to create the model that we want to approximate our data with! In this case, we will use a **LinearRegression** model. Since we only have one dependent variable (`y`), we will set `multi_output=false`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create model\n",
    "lr = ScikitLearn.Models.LinearRegression(multi_output=false)\n",
    "\n",
    "# Train the model\n",
    "fit!(lr, X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the trained model to predict values of y\n",
    "y_pred = predict(lr, X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Plots\n",
    "scatter(X, y, title=\"Simple Linear Regression Model\") # Plot initial Samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot!(X, y_pred, label=\"Linear Regression\") # Plot our approximation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi Linear Regression\n",
    "A Multi linear regression is a linear regression model with more than one independent variable and a single dependent variable, i.e., it is an (n+1)-dimensional model where the dependent variable `y` is described by a `n` variables (`x_1, x_2, ..., x_n`).\n",
    "In the following section we see how to create a multi-linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 50 2-dimensional samples\n",
    "X = rand(50, 2)\n",
    "y = mapslices(x -> x[1] - x[2], X, dims=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that our matrix `X` now has dimension `n_samples x 2`, and that `y`'s dimension remains `n_samples x 2`.\n",
    "\n",
    "Let's confirm that!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "println(\"size(X) == (50, 2): $(size(X) == (50, 2))\")\n",
    "println(\"size(y) == (50, 1): $(size(y) == (50, 1))\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it's time to create the model that we want to approximate our data with! In this case, we will use a **LinearRegression** model. Since we only have one dependent variable (`y`), we will set `multi_output=false`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create model\n",
    "lr = ScikitLearn.Models.LinearRegression(multi_output=false)\n",
    "\n",
    "# Train the model\n",
    "fit!(lr, X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the trained model to predict values of y\n",
    "y_pred = predict(lr, X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Plots\n",
    "scatter(X, y, title=\"Multi Linear Regression Model\", labels=[\"Original x_1\", \"Original x_2\"]) # Plot initial Samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scatter!(X, y_pred, labels=[\"Linear Regression x_1\" \"Linear Regression x_2\"]) # Plot our approximation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you observe closely, the predicted points lie exactly on top of the original sampled points.\n",
    "Now, we are going to analyse the relation of the dependent variable with the independent variable separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x1 = X[:, 1]\n",
    "p1 = scatter(x1, y, title=\"Linear Regression X_1\")\n",
    "scatter!(p1, x1, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x2 = X[:, 2]\n",
    "p2 = scatter(x2, y, title=\"Linear Regression X_2\")\n",
    "scatter!(p2, x2, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = plot(p1, p2, layout=(2, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi-Target Regression\n",
    "A Multi target regression is a linear regression model with at least one independent variable and more than one dependent variable, i.e., it is an (n+m)-dimensional model with `m` dependent variables (`y_1, y_2, ..., y_n`) being described by some functions of `n` variables (`x_1, x_2, ..., x_n`).\n",
    "In the following section we see how to create a multi-target regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 5 1-dimensional samples\n",
    "X = [1, 2, 3, 4, 5][:,:]\n",
    "y = [[1, 2, 3, 4, 5] [-1, -2, -3, -4, -5]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "println(\"size(X) == (5, 1): $(size(X) == (5, 1))\")\n",
    "println(\"size(y) == (5, 2): $(size(y) == (5, 2))\") # We now have two targets, i.e., two dependent variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create model\n",
    "lr = ScikitLearn.Models.LinearRegression(multi_output=true) # Now we have a multi_output problem!!\n",
    "\n",
    "# Train the model\n",
    "fit!(lr, X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the trained model to predict values of y\n",
    "y_pred = predict(lr, X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, because we did not add any noise and because the dependent variables are **linear functions** of the independent variable `x`, we expect the predict y to be very close to the original `Y` variables. To confirm this we will define a `close-enough` method that determines whether a given value `x0` is close enough of the value `x1` for a given tolerance (`tol`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "close_enough(x0, x1, tol=1e-14) = abs(x0 - x1) <= tol ? true : false"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To validate our initial idea that y_pred will be approximately equal to y, we expect the result of mapping the function close_enough to all elements of t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all(map(close_enough, y_pred, y)) "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.0",
   "language": "julia",
   "name": "julia-1.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
